{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/embeddings/nomic.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nomic Embedding\n",
    "\n",
    "Nomic has released v1.5 🪆🪆🪆 is capable of variable sized embeddings with matryoshka learning and an 8192 context, embedding dimensions between 64 and 768.\n",
    "\n",
    "In this notebook, we will explore using Nomic v1.5 embedding at different dimensions."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U llama-index llama-index-embeddings-nomic"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup API Keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nomic_api_key = \"<NOMIC API KEY>\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "from llama_index.embeddings.nomic import NomicEmbedding"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### With dimension at 128"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = NomicEmbedding(\n",
    "    api_key=nomic_api_key,\n",
    "    dimensionality=128,\n",
    "    model_name=\"nomic-embed-text-v1.5\",\n",
    ")\n",
    "\n",
    "embedding = embed_model.get_text_embedding(\"Nomic Embeddings\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "128\n"
     ]
    }
   ],
   "source": [
    "print(len(embedding))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.05569458, 0.057922363, -0.30126953, -0.09832764, 0.05947876]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding[:5]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### With dimension at 256"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = NomicEmbedding(\n",
    "    api_key=nomic_api_key,\n",
    "    dimensionality=256,\n",
    "    model_name=\"nomic-embed-text-v1.5\",\n",
    ")\n",
    "\n",
    "embedding = embed_model.get_text_embedding(\"Nomic Embeddings\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "256\n"
     ]
    }
   ],
   "source": [
    "print(len(embedding))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.044708252, 0.04650879, -0.24182129, -0.07897949, 0.04776001]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding[:5]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### With dimension at 768"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = NomicEmbedding(\n",
    "    api_key=nomic_api_key,\n",
    "    dimensionality=768,\n",
    "    model_name=\"nomic-embed-text-v1.5\",\n",
    ")\n",
    "\n",
    "embedding = embed_model.get_text_embedding(\"Nomic Embeddings\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "768\n"
     ]
    }
   ],
   "source": [
    "print(len(embedding))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.027282715, 0.028381348, -0.14758301, -0.048187256, 0.029144287]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding[:5]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### You can still use v1 Nomic Embeddings\n",
    "\n",
    "It has 768 fixed embedding dimensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embed_model = NomicEmbedding(\n",
    "    api_key=nomic_api_key, model_name=\"nomic-embed-text-v1\"\n",
    ")\n",
    "\n",
    "embedding = embed_model.get_text_embedding(\"Nomic Embeddings\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "768\n"
     ]
    }
   ],
   "source": [
    "print(len(embedding))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.0059013367, 0.03744507, 0.0035305023, -0.047180176, 0.0154418945]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding[:5]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's Build end to end RAG pipeline with Nomic v1.5 Embedding.\n",
    "\n",
    "We will use OpenAI for Generation step."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Set Embedding model and llm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import settings\n",
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"<YOUR OPENAI API KEY>\"\n",
    "\n",
    "embed_model = NomicEmbedding(\n",
    "    api_key=nomic_api_key,\n",
    "    dimensionality=128,\n",
    "    model_name=\"nomic-embed-text-v1.5\",\n",
    ")\n",
    "\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\")\n",
    "\n",
    "settings.llm = llm\n",
    "settings.embed_model = embed_model"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-02-16 18:37:03--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 75042 (73K) [text/plain]\n",
      "Saving to: 'data/paul_graham/paul_graham_essay.txt'\n",
      "\n",
      "data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.02s   \n",
      "\n",
      "2024-02-16 18:37:03 (3.87 MB/s) - 'data/paul_graham/paul_graham_essay.txt' saved [75042/75042]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham\").load_data()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Index creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorStoreIndex.from_documents(documents)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Query Engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The author, growing up, worked on writing and programming. They wrote short stories and also tried writing programs on an IBM 1401 computer. Later, they got a microcomputer and started programming more extensively, writing simple games and a word processor.\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\"what did author do growing up?\")\n",
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  },
  "vscode": {
   "interpreter": {
    "hash": "b1d2a638b53f4d7129cb7686d8e3b97ae1d80a593a1618479f60cef5591ea888"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
